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Abstract 



With the increasing size of today's data sets, finding the right parameter configuration in 
model selection via cross-validation can be an extremely time-consuming task. In this pa- 
per we propose an improved cross-validation procedure which uses non-parametric testing 
coupled with sequential analysis to determine the best parameter set on linearly increas- 
ing subsets of the data. By eliminating underperforming candidates quickly and keeping 
promising candidates as long as possible, the method speeds up the computation while 
preserving the power of the full cross-validation. Theoretical considerations underline the 
statistical power of our procedure. The experimental evaluation shows that our method 
reduces the computation time by a factor of up to 120 compared to a full cross-validation 
with a negligible impact on the accuracy. 

Keywords: Cross- Validation, Statistical Testing, Nonparametric Methods 
1. Introduction 

Cross-validation is a de-facto standard in applied machine learning to tune parameter con- 
figurations of machine learning methods in supervised learning settings (see Mosteller and 
Tukey 1968; Stone 1974; Geisser 1975 and also Arlot et al. 2010 for a recent and exten- 
sive review of the method). Part of the data is held back and used as a test set to get a 
less biased estimate of the true generalization error. Cross-validation is computationally 
quite demanding, though. Doing a full grid search on all possible combinations of param- 
eter candidates quickly takes a lot of time, even if one exploits the obvious potential for 
parallelization. 

Therefore, cross-validation is seldom executed in full in practice, but different heuristics 
are usually employed to speed up the computation. For example, instead of using the full 
grid, local search heuristics may be used to find local minima in the test error (see for 
instance Kohavi and John 1995; Bengio 2000; Keerthi et al. 2006). However, in general, as 
with all local search methods, no guarantees can be given as to the quality of the found 
local minima. Another frequently used heuristic is to perform the cross-validation on a 
subset of the data, and then train on the full data set to get the most accurate predictions. 
The problem here is to find the right size of the subset: If the subset is too small and 
cannot reflect the true complexity of the learning problem, the configurations selected by 
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cross-validation will lead to underfitted models. On the other hand, a too large subset will 
take longer for the cross-validation to finish. 

Applying those kinds of heuristics requires both an experienced practitioner and famil- 
iarity with the data set. However, the effects at play in the subset approach are more 
manageable: as we will discuss in more depth below, given increasing subsets of the data, 
the minimizer of the test error will converge, often much earlier than the test error it- 
self. Thus, using subsets in a systematic way opens up a promising way to speed up the 
model selection process, since training models on smaller subsets of the data is much more 
time-efficient. During this process care has to be taken when an increase in available data 
suddenly reveals more structure in the data, leading to a change of the optimal parameter 
configuration. Still, as we will discuss in more depth, there are ways to guard against such 
change points, making the heuristic of taking subsets a more promising candidate for an 
automated procedure. 

In this paper we will propose a method which speeds up cross-validation by considering 
subsets of increasing size. By removing clearly underperforming parameter configurations 
on the way this leads to a substantial saving in total computation time as sketched in 
Figure 1. In order to account for possible change points, sequential testing (Wald, 1947) 
is adapted to control a safety zone, roughly speaking, a certain number of allowed failures 
for a parameter configuration. At the same time this framework gives statistical guarantees 
for dropping clearly underperforming configurations. Finally, we add a stopping criterion 
to watch for early convergence of the process to further speed up the computation. 

In the following, we will first discuss the effects of taking subsets on learners and cross- 
validation (Section 2), present our method Fast Cross- Validation via Sequential Testing 
(CVST, Section 3), discuss the theoretical properties of the method (Section 4) and finally 
evaluate our method on synthetic and real- world data sets in Section 5. Section 6 gives an 
overview of related approaches and possible extensions and Section 7 concludes the paper. 
The impatient practitioner may skip some theoretical treatments and focus on the self- 
contained Section 3 describing the CVST algorithm and its evaluation in Section 5. To ease 
the reading process we collected our notational conventions in Table 1. 

2. Cross- Validation on Subsets 

Our approach is based on taking subsets of the data to speed up cross-validation. The main 
question is therefore whether we can reliably estimate the best parameter configuration 
already from subsets of the data. In this section we wish to study this question from 
a theoretical point of view. Unfortunately, an accurate formal discussion of the effects 
involved is not possible because existing error bounds are too loose due to their worst-case 
nature. Nevertheless, we will first formally prove that the estimate converges under mild 
assumptions. Then, we will look at actual errors in numerical simulations to get a clearer 
picture of the effects involved and explain why one can in general expect fast convergence 
of our fast cross-validation procedure. 

Let us first introduce some notation: Assume that our N training data points d{ are given 
by input/output pairs d{ = (Xi, Yi) £ X x y drawn i.i.d. from some probability distribution 
P on X x y. As usual, we also assume that we have some loss function I: y x y — > K 
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Figure 1: Performance of a 5-fold cross-validation (CV, left) and fast cross-validation via 
sequential testing (CVST, right): While the CV has to calculate the model for 
each configuration (here: a of a Gaussian kernel) on the full data set, the CVST 
algorithm uses increasing subsets of the data and drops significantly underper- 
forming configurations in each step (upper panels), resulting in a drastic decrease 
of total calculation time (sum of colored area in lower panels). 



given such that the overall error or expected risk of a predictor g: X — > y is given by 
= E[£(g(X),Y)] where (X,Y) ~ P. 
For some set of possible parameter configurations C, let g n (c) be the predictor learned 
for parameter c G C from the first n training examples. Cross-validation basically tries to 
identify the best parameter c for a given training set size n which minimizes the expected 
risk. 

We are considering training set sizes, so we are interested in the convergence of the 
sequence of functions e : N — >• (C — > R) defined by 

e n {c) = R{g n {c)). (1) 

We are thus interested in how the minimum of the function e m (c) at some subset size m 
relates to that of e n (c). This question is linked to the asymptotic behavior of e n (c) itself, 
because if the minimum of e n (c) converges, so will e m (c) to e n (c) eventually. 
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Symbol Description 



di = (Xi, Yi) G X x y Data points 

N Total data set size 

g : X i— > y Learned predictor 

£ : y x y h^R Loss function 

R(g) = E[£(g(X), Y)] Risk of predictor g 

c Configuration of learner 

C Set of examined configurations 

g n (c) : X i— > y Predictor learned on n data points for configuration c 

e n (c) = R{g n (c)) Risk of learner on n data points 

c* Overall best configuration 

c* Best configuration for models based on n data points 

s Current step of CVST procedure 

S Total number of steps 

A = N/S Increment of model size 

P p Pointwise performance matrix 

Ps Overall performance matrix of dimension \C\ x S 

Ts Trace matrix of dimension \C\ x S 

u> st0 p Size of early stopping window 

a,ai,/3i Significance levels 

7T Success probability of a binomial variable 



Table 1: List of symbols 



For the sake of simplicity, we will not consider the test error on a finite set, but directly 
consider the expected error. In principle, we would have to also consider a limit in the size of 
the test set, which is possible, but would lead to more complex formulas without adding to 
the discussion. To make this argument more precise: Actually, we would have to compare 
the error e n (c) against another empirical error e' m (c) defined on an independent sample. 
Instead, we compare e n (c) against c(c), the expected error of parameter configuration c. 
Technically, this would mean that we have to consider |e n (c) — e' m (c)\ < \e n (c) — e(c)| + 
|e' m (c) — e(c)|, and we get essentially the same results with an additional limit process of 
m — > oo, that is, considering the limit as the test set size also tends to infinity. So for the 
sake of simplicity, we have dropped this detail. 

Now in general, we cannot expect e n to converge at all. For example, it might be that 
the model parameter is encoded in a way which needs to be scaled with the sample size, 
i.e., it might be that a value c at sample size n corresponds to f(n)c for some function /(n), 
or more complex settings. Under such conditions, a parameter configuration c might work 
well at a subset size m but become a suboptimal choice for larger sample sizes. 

We will therefore assume that 



for each fixed c, lim e n (c) exists. (2) 

n— >oo 
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Fortunately, this condition holds for a number of standard methods, including feed-forward 
neural networks with one hidden layer and sigmoid activation functions, and kernel ma- 
chines like kernel ridge regression and support vector machines (for more details, consult 
Appendix A.) 

Now, using standard techniques, it is straightforward to prove the following result (see 
Appendix B for the proof). 

Theorem 1 Let C be a finite set and assume that for each fixed c, e n (c) — > e(c) in proba- 
bility. Then, the following holds: 

1. (Uniform convergence over C ') As n — >■ oo, 

m&x\e n (c) — e(c)| — > 

cec 

in probability. 

2. (Convergence of the minimum) Let c* be such that e n (c*) = min cg c e n (c), and 
c* such that e(c*) = min ce ce(c). Then, 

e ( c n) - e ( c *) < 2max|e n (c) - e(c)| 

Moreover, 

e(c*) — e(c*) — > in probability. 

3. (Evaluating on subsets of the data) For each e, 5 > 0, there exists a number I 
such that for all n > I and m with I < m < n, 

PMc* m ) - e n (c* n ) >e}<5. 

For the sake of simplicity, we have assumed that C is finite. These results can likely be 
extended to continuous parameter spaces with significant technical overhead. Theorem 1 
proves that, asymptotically, we can expect to get good estimates for the right choice of 
parameter configurations at training size n on a subset of size m. This result assumes 
that parameter configurations are encoded in a way which is independent of the size of the 
training set and hinges on uniform convergence over all possible parameter choices. 

Now how well does this result describe the practical findings? Figure 2(a) shows the test 
errors for a typical example. We train a support vector regression model (SVR) on subsets 
of the full training set consisting of 500 data points. The data set is the noisy sine data set 
introduced in Section 5.1. Model parameters are the kernel width a of the Gaussian kernel 
used, and the regularization parameter, where the values shown are already optimized over 
the regularization parameter for the sake of simplicity. 

We see that the minimum converges rather quickly, first to the plateau of log(cr) G 
[—1.5,-0.3] approximately, and then towards the lower one at [—2.5,-1.7], which is also 
the optimal one at training set size n = 500. We see that uniform convergence is not the 
main driving force. In fact, the errors for small kernel widths are still very far apart even 
when the minimum is already converged. 
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Figure 2: Test error of an SVR model on the noisy sine data set introduced in Section 5.1. 

We can observe a shift of the optimal a of the Gaussian kernel to the fine-grained 
structure of the problem, if we have seen enough data. In Figure (b), approxi- 
mation error is indicated by the black solid line, and the estimation error by the 
black dashed line. The asymptotic approximation error is plotted as the blue 
dashed line. One can see that uniform approximation of the estimation error is 
not the main driving force, instead, the decay of the approximation error with 
smaller kernel widths together with an increase of the estimation error at small 
kernel widths makes sure that the minimum converges quickly. 



In the following, it is helpful to continue the discussion within the empirical risk min- 
imization framework. We assume that the learner is trained by picking the model which 
minimizes the empirical risk over some hypothesis set Q. In this setting, one can write the 
difference between the expected risk of the learned predictor R(g n ) and the Bayes risk R* 
as follows (see also Section 12.1 in Devroye et al. 1996 or Section 2.4.3 in Mohri et al. 2012): 

R(9n) ~R*= (R(9n) - mfR(g)) + ( inf R(g) - R*)) . 

\ geG J \geg J 

V * ' V * ' 

estimation error approximation error 

The estimation error measures how far the chosen model is from the one which would 
be asymptotically optimal, while the approximation error measures the difference between 
the best possible model in the hypothesis class and the true function. 

Using this decomposition, we can interpret the figure as follows (see Figure 2(b)): The 
kernel width controls the approximation error. For log(er) > —1.8, the resulting hypothesis 
class is too coarse to represent the function under consideration. It becomes smaller until 
it reaches the level of the Bayes risk as indicated by the dashed blue line. For even larger 
training set sizes, we can assume that it will stay on this level even for smaller kernel sizes. 

The difference between the blue line and the upper lines shows the estimation error. The 
estimation error has been extensively studied in statistical learning theory and is known to 
be linked to different notions of complexity like VC-dimension (Vapnik, 1998), fat-shattering 
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dimension (Bartlett et al., 1996), or the norm in the reproducing kernel Hilbert space 
(RKHS) (Evgeniou and Pontil, 1999). A typical result shows that the estimation error can 
be bounded by terms of the form 



where d{Q) is some notion of complexity of the underlying hypothesis class, and R n (g n ) is 
the empirical risk on the training set. For our figure, this means that we can expect the 
estimation error to become larger for smaller kernel widths. 

So basically, if we order the parameter configurations according to their complexity, we 
make three observations: 

1. For parameter configurations with small complexity (that is, large kernel width), the 
approximation error will be high, but the estimation error will be small. 

2. For parameter configurations with high complexity, the approximation error will be 
small, even optimal, but the estimation error will be large. 

3. Also, as we see in Figure 2(b), the approximation error seems to decrease faster with 
increasing complexity than the estimation error increases. 

In combination, the estimates at smaller training set sizes tend to underestimate the true 
model complexity, but as the approximation error quickly decreases, the minimum also 
converges to the true one. The fact that the estimation error is larger for more complex 
models acts as a guard to choose too complex models. 

Unfortunately, existing theoretical results are not able to bound the error sufficiently 
tightly to make these arguments more exact. In particular, the speed of the convergence on 
the minimum hinges on a tight lower bound on the approximation error, and a realistic upper 
bound on the estimation error. Approximation errors have been studied for example in the 
papers by Smale and Zhou (2003) and Steinwart and Scovel (2007), but the papers only 
prove upper bounds, and the rates are also worst-case rates which are likely not close enough 
to the true errors. On the other hand, the mechanisms which lead to fast convergence of 
the minimum are plausible when looking at concrete examples as we did above. Therefore, 
we will assume in the following that the location of the best parameter configuration might 
initially change but then become more or less stable quickly. We will use sequential testing 
to introduce a safety zone which ensures that our method is robust against these initial 
changes. 

3. Fast-Cross Validation via Sequential Testing (CVST) 

Recall from Section 2 that we have a data set consisting of N data points di = (Xi,Yi) £ 
X x y which we assume to be drawn i.i.d. from P. We have a learning algorithm which 
depends on several parameters collected in a configuration c £ C. The goal is to select the 
configuration c* out of all possible configurations C such that the learned predictor g has 
the best generalization error with respect to some loss function t : y x y — > R. 




7 



Krueger, Panknin and Braun 



Our approach attempts to speed up the model selection process by learning just on 
subsamples of size n := = sA for 1 < s < S, starting with the full set of configurations 
and eliminating clearly underperforming configurations at each step s based on the perfor- 
mances observed in steps 1 to s. The main loop of Algorithm 1 on page 10 executes the 
following parts at each step s: 

© The procedure learns a model on the first n data points for the remaining configura- 
tions and stores the test errors on the remaining N — n data points in the pointwise 
performance matrix P p (Lines 10-14). This matrix P p is used on Lines 15-16 to esti- 
mate the top performing configurations via robust testing and saves the outcome as 
a binary "top or flop" scheme accordingly. 

© The procedure drops significant loser configurations along the way (Lines 17-19) using 
tests from the sequential analysis framework. 

© Applying robust, distribution free testing techniques allows for an early stopping of 
the procedure, when we have seen enough data for a stable parameter estimation 
(Line 20). 

In the following we will discuss the individual steps in the algorithm. A conceptual 
overview of one iteration of the procedure is depicted in Figure 3 for reference. Additionally, 
we have released a software package on CRAN named CVST which is publicly available via 
all official CRAN repositories and also via github (https://github.com/tammok/CVST). 
This package contains the CVST procedure and all learners used in Section 5 ready for use. 

3.1 Robust Transformation of Test Errors 

To robustly transform the performance of configurations into the binary information whether 
it is among the top-performing configurations or turns out to be a flop, we rely on distribution- 
free tests. The basic idea is to calculate the pointwise performance of a given configuration 
on data points not used for the learning of the model and find the group of best configura- 
tions, which show a similar behavior. 

We exemplify this procedure by the situation depicted in Figure 3 with K remaining 
configurations ci, C2, . . . , ck which are ordered according to their mean performances (i.e. 
sorted ascending with regard to their expected loss). We now want to find the smallest 
index k < K, such that the configurations ci, C2, . . . , c& all show the same behavior on the 
remaining data points d n +i, d n +2) ■ ■ • , dN not used in the current model learning process. 

The rational behind our comparison procedure is three- fold: First, by ordering the 
configurations by the mean performances we start with the comparison of the currently best 
performing configurations first. Second, by using the first n := sA data points for the model 
building and the remaining N — n data points for the estimation of the average performance 
of each configuration we compensate the error introduced by learning on smaller subsets of 
the data by better error estimates on more data points. I.e. for small s we will learn the 
model on relatively small subsets of the overall available data while we estimate the test error 
on relatively large portions of the data and vice versa. Third, by applying test procedures 
directly on the error estimates of individual data points we exploit a further robustifying 
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Figure 3: One step of CVST. Shown is the situation in step s = 10. O A model based on 
the first n data points is learned for each configuration (ci to ck)- Test errors 
are calculated on the remaining data (d n +i to djy) and transformed into a binary 
performance indicator via robust testing. © Traces of configurations are filtered 
via sequential analysis (ck-i and ck are dropped). © The procedure checks 
whether the remaining configurations perform equally well in the past and stops 
if this is the case. See Appendix F for a complete example run. 



pooling effect: if we have outliers in the testing data, all models will be affected by this and 
therefore the overall testing result will not be affected. 

To find the top performing configurations for step s we look at the outcome of the 
learned model for each configuration, i.e. we subsequently take the rows of the pointwise 
performance matrix P p into account and apply either the Friedman test (Friedman, 1937) for 
regression experiments or the Cochran's Q test (Cochran, 1950) to see whether we observe 
statistically significant differences between configurations (see Appendix G for a summary 
of these tests). 

More formally, the function top Configurations takes the pointwise performance ma- 
trix P p as input and rearranges the rows according to the mean performances of the con- 
figurations yielding a matrix P p . Now for k S {2, 3, . . . , K} we check, whether the first 
k configurations show a significantly different effect on the N — n data points. This is 
done by executing either the Friedman test or the Cochran's Q test on the submatrix 
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Algorithm 1 CVST Main Loop 
1: function CVST(di, ...,d N ,S, C, a, j3 h a h w stov ) 



2: A N/S 1 > Initialize subset increment 

3: n <— A > Initialize model size 

4: test <-getTest(S, Pi, on) > Get sequential test 

5: Vs G {l,...,S},c€ C : T s [c,s] 4- 

6: Vs G {l,...,S},c€ C : P s [c,s] <- NA 

7: Vc G (7 : isActivefc] <— true 

8: for s «- 1 to S do 

9: Vi G {l,...,JV-n},c€ C : P p [c,i] <- NA 

10: for c G C do 

11: if isActivefc] then 

12: g = g n {c) > Learn model on the first n data points 

13: Vi G {1, . . . , N — n} : P p [c, i] <— i(g(x n+ i),y n+ i) > Evaluate on the rest 

14: Ps[c, s] ^— j^—^ S£i n Pp[ c i A i> Store mean performance 

15: indextop ^— topConfigurations(P p , a) > Find the top configurations 

16: Tsfindextop, s] <—l > And set entry in trace matrix 

17: for c G C do 

18: if isActive[c] and isFLOPCONFlGURATlON(test, T s [c, 1 : s]) then 

19: isActivefc] ^— false > De-activate flop configuration 

20: if similarPerformance(T5 [isActive, (s - w s top + 1) : s], a) then 

21: break 

22: n <r- n + A 

23: return selectWinnner(P,s, isActive, u> st0 p, s) 



P p [l : k, 1 : (N — n)\ with the pre-specified significance level a. If the test does not indicate 
a significant difference in the performance of the k configurations, we increment k by one 
and test again until we find a significant effect. Suppose we find a significant effect at index 
k. Since all previous tests indicated no significant effect for the k — 1 configurations we 
argue that the addition of the k configuration must have triggered the test procedure to 
indicate that in the set of these k configurations is at least one configuration, which shows 
a significantly different behavior than all other configurations. Thus, we flag the configu- 
rations I, ... ,k — I as top configurations and the remaining k, . . . , K configurations as flop 
configurations. Note that this incremental procedure is a multiple testing situation, thus 
we apply the Bonferroni correction to the calculated p-values. 

For the actual calculation of the test errors we apply an incremental model building 
process, i.e., the data added in each step on Line 22 increases the training data pool for 
each step by a set of size A. This would allow online algorithms to adapt their model also 
incrementally leading to even further speed improvements. The results of this first step are 
collected for each configuration in the trace matrix T$ (see Figure 3, top right), which shows 
the gradual transformation for the last 10 steps of the procedure highlighting the results 
of the last test. So the robust transformation of the test error boils down the performance 
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of all models learned on the first n data points to a new column in the trace matrix T$ 
recording the history of each configuration in a top or flop scheme. 

3.2 Determining Significant Losers 

Having transformed the test errors in a scale-independent top or flop scheme, we can now 
test whether a given parameter configuration is an overall loser. Sequential testing of 
binary random variables is addressed in the sequential analysis framework developed by 
Wald (1947). Originally it has been applied in the context of production quality assessment 
(compare two production processes) or biological settings (stop bioassays as soon as the 
gathered data leads to a significant result). 

The main idea is the following: One observes a sequence of i.i.d. Bernoulli variables 
B\, B2, ■ ■ ., and wants to test whether these variables are distributed according to the hy- 
potheses Hq : Bi ~ ttq or the alternative hypotheses H± : Bi ~ m with ttq < m denoting 
the according success probabilities of the Bernoulli variables. Both significance levels for 
the acceptance of Hi and Hq can be controlled via the user-supplied meta-parameters a\ 
and (5[. The test computes the likelihood for the so far observed data and rejects one of 
the hypothesis when the respective likelihood ratio exceeds an interval controlled by the 
meta-parameters. It can be shown that the procedure has a very intuitive geometric repre- 
sentation, shown in Figure 3, lower left: The binary observations are recorded as cumulative 
sums at each time step. If this sum exceeds the upper red line L±, we accept Hi; if the sum 
is below the lower red line Lq we accept Hq; if the sum stays between the two red lines we 
have to draw another sample. 

Wald's test requires that we fix both success probabilities ttq and tti beforehand. Since 
our main goal is to use the sequential test to eliminate underperformers, we choose the 
parameters ttq and m of the test such that Hi (a configuration wins) is postponed as long 
as possible. This will allow the CVST algorithm to keep configurations until the evidence 
of their performances definitely shows that they are overall loser configurations. At the 
same time, we want to maximize the area where configurations are eliminated (region Ah 
denoted by "LOSER" in Fig. 3), rejecting as many loser configurations on the way as 
possible: 

(tt ) 7Tl ) = argmax A Ho (ir , ir[ ,(3 h ai) (3) 
7r o> 7r i 

s.t. S a (4,7ri,A,a ; )G (S-1,S] 

with S a (-, -, ■, •) being the earliest step of acceptance of Hi marked by an X in Fig. 3 and 
the variable S defined as the total number of steps. Using results from Wald (1947), the 
global optimization in Equation (4) can be solved as follows: 

7T = 0.5 Avri = min ASN (vro,^ 1 7T = 1.0) > S (4) 

where ASN(-, •) (Average Sample Number) is the expected number of steps until the given 
test will yield a decision, if the underlying success probability of the tested sequence is 
7T = 1.0. Equipped with this test, we can check each remaining trace on Line 18 of Algo- 
rithm 1 in the function isFlopConfiguration whether it is a statistically significant flop 
configuration (i.e. exceeds the lower decision boundary Lq) or not. 
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Note that sequential analysis formally requires i.i.d. variables, which might not be true 
for configurations which transform to a winner configuration later on, thereby changing their 
behavior from a flop to a top configuration. Therefore we tuned our procedure to use the 
sequential analysis framework just for the decision whether a configuration is an overall loser 
or not. The test is adjusted for this switch of roles by keeping potential configurations as long 
as possible and just drop them if its trace statistical significantly corresponds to a binomial 
with 7T < 0.5. For details of the open sequential analysis please consult Wald (1947) or see 
for instance Wetherill and Glazebrook (1986) for a general overview of sequential testing 
procedures. Appendix C contains the necessary details needed to implement the proposed 
testing scheme for the CVST algorithm. 

3.3 Early Stopping and Final Winner 

Finally, we employ an early stopping rule (Line 20) which takes the last tiptop columns from 
the trace matrix and checks whether all remaining configurations performed equally well 
in the past. In Figure 3 this submatrix of the overall trace matrix T5 is shown for a value 
of Wstop = 4 for the remaining configurations after step 10. For the test, we again apply 
the Cochran's Q test (see Appendix G) in the similarPerformance procedure on the 
submatrix of T5. Figure 4 illustrates a complete run of the CVST algorithm for roughly 
600 configurations. Each configuration marked in red corresponds to a flop configuration 
and a black one to a top configuration. Finally, configurations marked in gray have been 
dropped via the sequential test during the CVST algorithm. The small zoom-ins in the lower 
part of the picture show the last w s top remaining configurations during each step which are 
used in the evaluation of the early stopping criterion. We can see that the procedure keeps 
on going if there is a heterogeneous behavior of the remaining configurations (zoom-in is 
mixed red/black). When all the remaining configurations performed equally well in the past 
(zoom-in is nearly black), the early stopping test does not see a significant effect anymore 
and the procedure is stopped. 

Finally, in the procedure select Winner, Line 23, the winning configuration is picked 
from the configurations which have survived all steps as follows: For each remaining con- 
figuration we determine the rank in a step according to the average performance during 
this step. Then we average the rank over the last w st0 p steps and pick the configuration 
which has the lowest mean rank. This way, we make most use of the data accumulated 
during the course of the procedure. By restricting our view to the last tiptop observations 
we also take into account that the optimal parameter might change with increasing model 
size: since we focus on the most recent observations with the biggest models, we always 
pick the configuration which is most suitable for the data size at hand. 

3.4 Meta-Parameters for the CVST 

The CVST algorithm has a number of meta-parameters which the experimenter has to 
choose beforehand. In this section we give suggestions on how to choose these parameters. 
The parameter a controls the significance level for the test for similar behavior in each step 
of the procedure. We suggest to set this to the usual level of a = 0.05. Furthermore f3i and 
a\ control the significance level of the H§ (configuration is a loser) and H\ (configuration is 
a winner) respectively. We suggest an asymmetric setup by setting fli = 0.1, since we want 
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Figure 4: The upper plot shows a run of the CVST algorithm for roughly 600 configurations. 

At each step a configuration is marked as top (black) , flop (red) or dropped (gray) . 
The zoom-ins show the situation for step 5 to 7 without the dropped entries. The 
early stopping rule takes affect in step 7, because the remaining configurations 
performed equally well during step 5 to 7. 



to drop loser configurations relatively fast and a; = 0.01, since we want to be really sure 
when we accept a configuration as overall winner. Finally, we set w s t p to 3 for S = 10 and 
6 for S = 20, as we have observed that this choice works well in practice. 



4. Theoretical Properties of the CVST Algorithm 

After having introduced the overall concept of the CVST algorithm, we now focus on the 
theoretical properties, which ensure the proper working of the procedure: Exploiting guar- 
antees of the underlying sequential testing framework, we show how the experimenter can 
control the procedure to work in a stable regime and furthermore prove error bounds for 
the CVST algorithm. Additionally, we show how the CVST algorithm can be used to work 
best on a given time budget. Finally, we discuss some unsolved questions and give possible 
directions for future research. 
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4.1 Error Bounds in a Stable Regime 

As discussed in Section 2 the performance of a configuration might change if we feed the 
learning algorithm more data. Therefore, a reasonable algorithm exploiting the learning on 
subsets of the data must be capable of dealing with these difficulties and potential change 
points in the behavior of certain configurations. In this section we investigate some theo- 
retical properties of the CVST algorithm which makes it particularly suitable for learning 
on increasing subsets of the data. 

The first property of the open sequential test employed in the CVST algorithm comes 
in handy to control the overall convergence process and to assure that no configurations are 
dropped prematurely: 

Lemma 2 (Safety Zone) Given the CVST algorithm with significance level ai,f3[ for be- 
ing a top or flop configuration respectively, and maximal number of steps S, and a global 
winning configuration, which looses for the first s cp iterations, as long as 



Sc S safe ^g J^- 

< < with s sa r e = ai and S > 



log /log 2 

ai 



the probability that the configuration is dropped by the CVST algorithm is zero. 
Proof The details of the proof are deferred to Appendix C. 



The consequence of Lemma 2 is that the experimenter can directly control via the signif- 
icance levels ai , j3i until which iteration no premature dropping should occur and therefore 
guide the whole process into a stable regime in which the configurations will see enough 
data to show their real performance. 

Equipped with this property we can now take a thorough look at the worst case perfor- 
mance of the CVST algorithm: Suppose a global winning configuration has been constantly 
marked as a loser up to the safety zone, because the amount of data available up to this 
point was not sufficient to show the superiority of this configuration. Given that the global 
winning configuration now sees enough data to be marked as a winning configuration by 
the binarization process throughout the next steps with probability 7r, we can give exact 
error bounds of the overall process by solving specific recurrences. 

Figure 5 gives a visual impression of our worst case analysis for the example of a 20 step 
CVST execution: The winning configuration generated a straight line of zeros up to the 
safety zone of 7. Our approach to bound the error of the fast cross-validation now consists 
essentially in calculating the probability mass that ends up in the non-loser region. The 
following lemma shows how we can express the number of paths which lead to a specific 
point on the graph by a two-dimensional recurrence relation: 

Lemma 3 (Recurrence Relation) Denote by Path(s^, sc) the number of paths, which 
lead to the point at the intersection of row sr and column sc and lie above the lower decision 
boundary Lq of the sequential test. Given the worst case scenario described above the number 
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Figure 5: Visualization of the worst-case scenario for the error probability of the CVST 
algorithm: a global winner configuration is labeled as a constant loser until the 
safety zone is reached. Then we can calculate the probability that this configura- 
tion endures the sequential test by a recurrence scheme, which counts the number 
of remaining paths ending up in the non-loser region. 



of paths can be calculated as follows: 



Path(s fi ,sc) 



1 



if = A c < s sai 



log T 



■■fe 



log 2- 



1 if SR = SC- S sa fe 

Path(s fi , s c - 1) + Path(s R - 1, s c - 1) if L Q (c) < s R < s c ~ s safe 







otherwise. 



Proof We split the proof into the four cases: 

1. The first case is by definition: the configuration has a straight line of zeros up to the 
safety zone s sa f e - 

2. The second case describes the diagonal path starting from the point (1, s sa f e + 1): by 
construction of the paths (1 means diagonal up; means one step to the right) the 
diagonal path can just be reached by a single combination, namely a straight line of 
ones. 

3. The third case is the actual recurrence: if the given point is above the lower decision 
bound Lq, then the number of paths leading to this point is equal to the number 
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of paths that lie directly to the left of this point plus the paths which lie directly 
diagonal downwards from this point. From the first paths this point can be reached 
by a direct step to the right and from the latter the current point can be reached by 
a diagonal step upwards. Since there are no other options than that by construction, 
this equality holds. 

4. The last case describes all other paths, which either lie below the lower decision bound 
and therefore end up in the loser region or are above the diagonal and thus can never 
be reached. 



This recurrence is visualized in Figure 5. Each number on the grid gives the number of 
valid, non-loser paths, which can reach the specific point. With this recurrence we are now 
able to prove a global, worst-case error probability of the fast cross-validation. 

Theorem 4 (Error Bound of CVST) Suppose a global winning configuration has reached 
the safety zone with a constant loser trace and then switches to a winner configuration with 
a success probability of n. Then the error that the CVST algorithm erroneously drops this 
configuration can be determined as follows: 



P(reject n) < 1 - J] Path(i, S)tt\1 - 7r) r " i with r = S- 

i=[L (S)}+l 

Proof The basic idea is to use the number of paths leading to the non-loser region to 
calculate the probability that the configuration actually survives. This corresponds to the 
last column of the example in Figure 5. Since we model the outcome of the binarization 
process as a binomial variable with the success probability of it, the first diagonal path has 
a probability of ir r . The next paths each have a probability of 7r( r_1 )(l — ir) 1 and so on 
until the last viable paths are reached in the point (L-^o(-S')J + 1, «S*). So the complete prob- 
ability of the survival of the configuration is summed up with the corresponding number of 
paths from Lemma 3. Since we are interested in the complementary event, we subtract the 
resulting sum from one, which concludes the proof. ■ 



log I 



log 2 



ft 

-a-l 

Oil . 



Note that the early stopping rule does not interfere with this bound: The worst case 
is indeed that the process goes on for the maximal number of steps S, since then the 
probability mass will be maximally spread due to the linear lower decision boundary and 
the corresponding exponents are maximal. So if the early stopping rule terminates the 
process before reaching the maximum number of steps, the resulting error probability will 
be lower than our given bound. 

The error bound for different success probabilities and the proposed sequential test with 
ai = 0.01 and /?/ = 0.1 are depicted in Figure 6. First of all we can observe a relative fast 
convergence of the overall error with increasing maximal number of steps S. The impact on 
the error is marginal for the shown success probabilities, i.e. for instance for tt = 0.95 the 
error nearly converges to the optimum of 0.05. Note that the oscillations especially for small 
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10 15 20 25 



Figure 6: Error bound of the fast cross-validation as proven in Theorem 4 for different 
success probabilities tt and maximal step sizes S. To mark the global trend we 
fitted a LOESS curve given as dotted line to the data. 



step sizes originate from the rectangular grid imposed by the interplay of the Path-operator 
and the lower decision boundary Lq leading to some fluctuations. Overall, the chosen test 
scheme allows us not only to control the safety zone but also has only a small impact on 
the error probability, which once again shows the practicality of the open sequential ratio 
test for the fast cross-validation procedure. By using this statistical test we can balance the 
need for a conservative retention of configurations as long as possible with the statistically 
controlled dropping of significant loser configurations with nearly no impact on the overall 
error probability. Our analysis assumes that the experimenter has chosen the right safety 
zone for the learning problem at hand. For small data sizes it could happen that this safety 
zone was chosen too small, therefore the change point of the global winning configuration 
might lie outside the safety zone. While this will not occur often for today's sizes of data 
sets we have analyzed the behavior of CVST under this circumstances in Appendix D to 
give a complete view of the properties of the algorithm. 

4.2 Fast-Cross Validation on a Time Budget 

While the CVST algorithm can be used out of the box to speed up regular cross-validation, 
the aforementioned properties of the procedure come in handy when we face a situation 
in which an optimal parameter configuration has to be found given a fixed computational 
budget. If the time is not sufficient to perform a full cross-validation or the amount of data 
that has to be processed is too big to explore a sufficiently spaced parameter grid with 
ordinary cross-validation in a reasonable time, the CVST algorithm allows for getting the 
most model selection information out of the data given the specified time constraint. 

This is achieved by calculating a maximal steps parameter S which leads to a near 
coverage of the available time budget T as depicted in Figure 7. The idea is to specify 
an expected drop ratio r of configurations and a safety zone bound s sa f e . Then we can 
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Figure 7: Approximation of the time consumption for a cubic learner. In each step we 
calculate a model on a subset of the data, so the model calculation time t on the 
full data set is adjusted accordingly. After s r x S steps of the process, we assume 
a drop to r x if remaining configurations. 



give a rough estimate of the total time needed for a CVST with a total number of steps 
S, equating this with the available time budget T and solving for S. More formally, given 
K parameter configurations and a pre-specified safety zone bound s sa f e = s r x S with 
< s r < 1 to ensure that no configuration is dropped prematurely, the computational 
demands of the CVST algorithm are approximated by the sum of the time needed before 
step s sa f e involving the model calculation of all K configurations and after step s sa f e for 
r x K configurations with < r < 1. As we will see in the experimental evaluation section, 
this assumption of a given drop rate of (1 — r) leading to the form of time consumption as 
depicted in Figure 7 is quite common. The observed drop rate corresponds to the overall 
difficulty of the problem at hand. 

Given the computation time t needed to perform the model calculation on the full data 
set, we prove in Appendix E that the optimal maximum step parameter for a cubic learner 
can be calculated as follows: 



(1 — r)s^ + r 
(1 - r)sf. + r 

After calculating the maximal number of steps S given the time budget T, we can use the 
results of Lemma 2 to determine the maximal Pi given a fixed a/ , which yields the requested 
safety zone bound s S afe- 

4.3 Discussion of Further Theoretical Analyses 

One precondition of the CVST algorithm as discussed in Section 2 is the necessity that 
the optimal parameter configurations is independent of the number of data points. It is 
a well-known fact that for instance for k nearest neighbor algorithms the k scales with 
the number of data point, thus rendering the application of the CVST algorithm in its 
current incarnation unsuitable. If this law of scaling is known, one could find a mapping 
of configurations from previous steps to configuration of the current steps, i.e. the CVST 



2T - Kt{\ - r)s 6 r - rKt 
({1 - r)sf + r)Kt 



+ 



Kt{\ - r) S 3 + r Kt - 2T 
((l-r)sf + r)Kt 
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algorithm would need an additional layer of indirection with regard to the configurations 
when accessing information from previous runs out of the trace matrix Tg and the overall 
performance matrix Pg. 

An additional concern to the practitioner is how to choose the correct size of the safety- 
zone s sa f e . If the training set does not contain enough data to get to a stable regime of 
the parameter configurations, even regular cross-validation on the full data set would yield 
incorrect configurations. But if we have just barely enough data to reach this stable region, 
setting the right safety zone is essential for the CVST algorithm to return the correct 
configurations. Unfortunately we are not aware of any test or bound which could hint at 
the right safety zone given a data set and learner. Yet, in today's world of big data where 
sample sizes are more often too big than too small, this might not pose a serious problem 
anymore. Nevertheless, we have analyzed the behavior of the CVST algorithm in case the 
experimenter underestimates the safety zone in Appendix D showing that even for these 
cases CVST is able to absorb a certain amount of misspecification. 

The similarity test introduced in Section 3.1 relies on two assumptions: First, the aver- 
aged loss function over the data not used for training in one step gives us a good indicator 
of the performance of a configuration. Second, well performing configurations show simi- 
lar behavior in classification or regression on the data not used for learning. While these 
assumptions definitely make sense, they encode a certain optimism of how the grid of con- 
figurations is populated: If we have too few configurations as input to the procedure it 
might happen that some non-optimal configurations mask out the other, normally optimal, 
configurations just by chance. To overcome this problem we therefore would need a certain 
amount of redundancy in the configuration grid. Both the amount of redundancy and thus 
the similarity measure underlying this redundancy assumption are hard to grasp theoreti- 
cally, yet, it could lead to new ways to model the binary transformation of the performance 
of configurations in each step of the CVST algorithm. 

There might be even further potential in the behavior of similar configurations that 
could be used in the CVST algorithm: If there is a notion of similarity between different 
configurations, it would be interesting to exploit this information and incorporate it into 
the CVST algorithm. For instance, one could add this kind of information in the function 
topConfigurations of Algorithm 1 to average the result of similar configurations and, 
hence, extend the pooling effect of the test already available for the data point dimension 
in the direction of configurations. 

While the selection scheme explained in Section 3.2 deals with the fact of potential 
change points of a configuration, it is not clear how independent the individual entries of a 
trace for a given configuration are and how much these potential dependencies influence the 
power of the sequential testing framework. Preliminary experiments comparing the CVST 
algorithm as described in this paper and a version of the CVST algorithm where at each 
step the data pool is shuffled, thus, yielding always different data points for learning and 
evaluation, showed no significant differences between these two versions. This shows that at 
least the potential dependencies introduced by the subsequent addition of data points does 
not interfere with the dependency assumption of the sequential testing framework. We will 
see in the evaluation section that the CVST procedure in its current form shows excellent 
behavior throughout a wide range of data sets; yet, further research of the theoretical 
properties of CVST might yield even better procedures in the future. 
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5. Experiments 

Before we evaluate the CVST algorithm on real data, we investigate its performance on 
controlled data sets. Both for regression and classification tasks we introduce special tai- 
lored data sets to highlight the overall behavior and to stress-test the fast cross-validation 
procedure. To evaluate how the choice of learning method influences the performance of the 
CVST algorithm, we compare kernel logistic regression (KLR) against a ^-Support Vector 
Machine (SVM) for classification problems and kernel ridge regression (KRR) versus v- 
SVR for regression problems each using a Gaussian kernel (see Roth, 2001; Scholkopf et al., 
2000). In all experiments we use a 10 step CVST with parameter settings as described in 
Section 3.4 (i. e. a = 0.05, a\ = 0.01, = 0.1,w s t O p = 3) to give us an upper bound of the 
expected speed gain. Note that we could get even higher speed gains by either lowering the 
number of steps or increasing fy. From a practical point of view we believe that the settings 
studied are highly realistic. 

5.1 Artificial Data Sets 

To assess the quality of the CVST algorithm we first examine its behavior in a controlled 
setting. We have seen in our motivation section that a specific learning problem might 
have several layers of structure which can only be revealed by the learner if enough data is 
available. For instance in Figure 2(a) we can see that the first optimal plateau occurs at 
a = 0.1, while the real optimal parameter centers around a = 0.01. Thus, the real optimal 
choice just becomes apparent if we have seen more than 200 data points. 

In this section we construct a learning problem both for regression and classification 
tasks which could pose severe problems for the CVST algorithm: If it stops too early, it will 
return a suboptimal parameter set. We evaluate how different intrinsic dimensionalities of 
the data and various noise levels affect the performance of the procedure. For classification 
tasks we use the noisy sine data set, which consists of a sine uniformly sampled from a 
range controlled by the intrinsic dimensionality d: 

y = sm(x) + e with e ~ Af(0, n 2 ),x G [0, 2ird],n G {0.25, 0.5}, d G {5, 50, 100} 

The labels of the sampled points are just the sign of y. For regression tasks we devise the 
noisy sine data set, which consists of a sine function overlayed with a high-frequency sine: 

y = sinc(4x) + sm ( 1Mx ) + e with e ^ y^( , n 2 ), x G [-vr, tt], n G {0.1, 0.2}, d G {2, 3, 4} 
o 

For each of these data sets we generate 1,000 data points and run a 10 step CVST and 
compare its results with a normal 10-fold cross-validation on the full data set. We record 
both the test error on additional 10,000 data points and the time consumed for the parameter 
search. The explored parameter grid contains 610 equally spaced parameter configurations 
for each method (log 10 (a) G {-3, -2.9, . . . , 3} and v G {0.05, 0.1, 0.5} for SVM/SVR 
and log 10 (A) G {—7, —6, ...2} for KLR/KRR, respectively). This process is repeated 50 
times to gather sufficient data for an interpretation of the overall process. 

The results for the noisy sine data set can be seen in Figure 8. The left boxplots show 
the distribution of the difference in mean square error of the best parameter determined 
by CVST and normal cross-validation. In the low noise setting (n = 0.25) the CVST 
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algorithm finds the same optimal parameter as the normal cross-validation up to the intrinsic 
dimensionality of d = 50. For d = 100 the CVST algorithm gets stuck in a suboptimal 
parameter configuration yielding an increased classification error compared to the normal 
cross-validation. This tendency is slightly increased in the high noise setting (n = 0.5) 
yielding a broader distribution. The classification method used seems to have no direct 
influence on the difference, both SVM and KLR show nearly similar behavior. This picture 
changes when we look at the speed gains: While the SVM nearly always ranges between 
10 and 20, the KLR shows a speed-up between 20 and 70 times. The variance of the speed 
gain is generally higher compared to the SVM which seems to be a direct consequence of 
the inner workings of KLR: The main loop performs at each step a matrix inversion of the 
whole kernel matrix until the calculated coefficients converge. Obviously this convergence 
criterion leads to a relative wide-spread distribution of the speed gain when compared to 
the SVM performance. 

Figure 9 shows the distribution of the number of remaining configurations after each 
step of the CVST algorithm. In the low noise setting (upper row) we can observe a tendency 
of bigger dropping rates up to d = 100. For the high noise setting (lower row) we observe 
a steady increase of kept configurations combined with a higher spread of the distribution. 
Overall we see a very effective dropping rate of configurations for all settings. The SVM 
and the KLR show nearly similar behavior so that the higher speed gain of the KLR we 
have seen before is a direct consequence of the algorithm itself and is not influenced by the 
CVST algorithm. 

The performance on the noisy sine data set is shown in Figure 10. The first striking 
observation is the transition of the CVST algorithm which can be observed for the intrinsic 
dimensionality of d = 3. At this point the overall excellent performance of the CVST 
algorithm is on the verge of choosing a suboptimal parameter configuration. This behavior 
is more evident in the high noise setting. In the case of SVR the difference to the solution 
found by the normal cross-validation is always smaller than for KRR. The speed gain 
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Figure 10: Difference in mean square error (left plots) and relative speed gain (right plots) 
for the noisy sine data set. 
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Figure 11: Remaining configurations after each step for the noisy sine data set. 

observed shows a small decline over the different dimensionalities and noise levels and 
ranges between 10 and 20 for the SVR and 50 to 100 for KRR. 

This is a direct consequence of the behavior which can be observed in the number of 
remaining configurations shown in Figure 11. Compared to the classification experiments 
the drop is much more drastic. The intrinsic dimensionality and the noise level show a small 
influence (higher dimensionality or noise level yields more remaining configurations) but the 
overall variance of the distribution is much smaller than in the classification experiments. 

In Figure 12 we examine the influence of more data on the performance of the CVST 
algorithm. Both for the noisy sine and noisy sine data set we are able to estimate the correct 
parameter configuration for all noise and dimensionality settings if we feed the CVST with 
enough data. 1 Clearly, the CVST is capable of extracting the right parameter configuration 
if we increase the amount of data to 2000 or 5000 data points, rendering our method even 
more suitable for big data scenarios: If data is abundant, CVST will be able to estimate 
the correct parameter in a much smaller time frame. 

5.2 Benchmark Data Sets 

After demonstrating the overall performance of the CVST algorithm on controlled data 
sets we will investigate its performance on real life and well known benchmark data sets. 
For classification we picked a representative choice of data sets from the IDA benchmark 
repository (see Ratsch et al. 2001 2 ). Furthermore we added the first two classes with the 

1. Note that we have to limit this experiment to the SVM/SVR method, since the full cross-validation of 
the KLR/KRR would have taken to much time to compute. 

2. Available at http://www.mldata.org. 
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Figure 12: Difference in mean square error for SVM/SVR with increasing data set size for 
noisy sine (left) and the noisy sine (right) data sets. By adding more data, the 
CVST algorithm converges to the correct parameter configuration. 



most entries of the covertype data set (see Blackard and Dean, 1999). Then we follow the 
procedure of the paper in sampling 2,000 data points of each class for the model learning 
and estimate the test error on the remaining data points. For regression we pick the data 
used in Donoho and Johnstone (1994) and add the bank32mn, pumadyn32mn and kin32mn 
of the Delve repository. 3 

We process each data set as follows: First we normalize each variable of the data to 
zero mean and variance of one, and in case of regression we also normalize the dependent 
variable. Then we split the data set in half and use one part for training and the other for the 
estimation of the test error. This process is repeated 50 times to get sufficient statistics for 
the performance of the methods. As in the artificial data setting we compare the difference 
in test error and the speed gain of the fast compared to the normal cross-validation on the 
same parameter grid of 610 values. 4 

Figure 13 shows the result for the classification data sets (left side) and the regression 
data sets (right side). The upper panels depict the difference in mean square error (MSE). 
For the classification tasks this difference never exceeds two percent points showing that 
although the fast cross-validation procedure in some cases seems to pick a suboptimal 
parameter set, the impact of this false decision is small. The same holds true for the 
regression tasks: since the dependent variables for all problems have been normalized to zero 

3. Available at http://www.cs.toronto.edu/~delve. 

4. For the blocks, bumps, and doppler data set of Donoho and Johnstone (1994) we had to adjust the range 
of a to log 10 (<r) £ { — 6, —5.9, . . . , 0} to adjust to the small structure found in these data sets. 
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Figure 14: Remaining configurations after each step for different benchmark data sets. 
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mean and variance of one, the differences in MSE values are comparable. We observe that 
as for the classification tasks we see just a very small difference in MSE. Although for some 
problems the CVST algorithm picks a suboptimal parameter set, even then the differences in 
error are always relatively small. The learners have hardly any impact on the behavior; just 
for the covertype and the blocks, bumps and doppler data set we see a significant difference 
of the corresponding methods. Especially for the bumps data set in combination with KRR 
we see a high deviation from the normal cross-validation result indicating a suboptimal 
parameter choice of the CVST algorithm. Interestingly, the performance of the SVR on 
this data set shows nearly no deviation from the result of the normal cross-validation. In 
terms of speed gain we see a much more diverse and varying picture. Overall, the speed 
improvements for KLR and KRR are higher than for SVM and SVR and reach up to 120 
times compared to normal cross-validation. Regression tasks in general seem to be solved 
faster than classification tasks, which can clearly be explained when we look at the traces 
in Figure 14: For classification tasks the number of kept configurations is generally much 
higher than for the regression tasks. Furthermore we can observe several types of difficulty 
of the learning problems. For instance the german data set seems to be much more difficult 
than the ringnorm data (see Braun et al., 2008) which is also reflected in the difference and 
speed improvement seen in the previous figure. 

In summary, the evaluation of the benchmark data sets shows that the CVST algorithm 
gives a huge speed improvement compared to the normal cross-validation. While we see some 
non-optimal choices of configurations, the total impact on the error is never exceptionally 
high. We have to keep in mind that we have chosen the parameters of our CVST algorithm 
to give an impression of the maximal attainable speed-up: more conservative settings would 
trade computational time for lowering the impact on the test error. 

6. Discussion and Related Work 

In this Section we will deal with several aspects of the CVST algorithm. Since sequential 
testing has been used extensively in the machine learning context Section 6.1 summarizes 
this work and discusses its relationship to the CVST algorithm. Furthermore, we illuminate 
the inner structure of the overall procedure and discuss potential extensions and properties 
of specific steps. 

The CVST algorithm consists of a sequence of tightly coupled modules: The output 
of the top or flop test is the input for the subsequent test for significant losers. The 
performance history of all remaining configurations is then the input for the early stopping 
rule which looks for similar performance of the remaining configurations on the learning 
problem to capture the right point in time to stop the CVST loop. This stepwise procedure 
is depicted in Figure 15: While the tests for top or flop configurations (step O) and the 
following sequential analysis (step ©) focuses solely on the individual configurations, the 
early stopping rule (step ©) acts on a global scope by determining the right point to stop 
the CVST algorithm. Thus, we face two kinds of test, namely the configuration-specific 
and the problem-specific tests. 

To complete our discussion of the CVST algorithm, we focus on the configuration- 
specific procedures. First, we analyze the inner structure of the similarity test based on the 
error landscape in Section 6.2 and how this module can be adjusted for specific side con- 
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Figure 15: Conceptual view of the CVST algorithm. Each execution of the loop body con- 
sists of a sequence of test, each delivering the input for the following test. This 
modular structure allows for customization of the CVST algorithm to special 
situations (multi-class experiments, structured learning etc.). 



straints. Furthermore, in Section 6.3 we look at the suitability of the sequential analysis for 
determining significant loser configurations. It is shown that this so-called closed sequential 
test lacks essential properties of the open variant of Wald used in the CVST algorithm, 
which further underlines the appropriateness of the open test of Wald for the learning on 
increasing subsets of data. 

6.1 Sequential Testing in Machine Learning 

Using statistical tests and the sequential analysis framework in order to speed up learning 
has been the topic of several lines of research. However, the existing body of work mostly 
focuses on reducing the number of test evaluations, while we focus on the overall process of 
eliminating candidates themselves. To the best of our knowledge, this is a new concept and 
can apparently be combined with the already available racing techniques to further reduce 
the total calculation time. 

Maron and Moore (1994, 1997) introduce the so-called Hoeffding Races which are based 
on the non-parametric Hoeffding bound for the mean of the test error. At each step of the 
algorithm a new test point is evaluated by all remaining models and the confidence intervals 
of the test errors are updated accordingly. Models whose confidence interval of the test error 
lies outside of at least one interval of a better performing model are dropped. Chien et al. 
(1995, 1999) devise a similar range of algorithms using concepts of PAC learning and game 
theory: different hypotheses are ordered by their expected utility according to the test data 
the algorithm has seen so far. As for Hoeffding Races, the emphasis in this approach lies 
on reducing the number of evaluations. 

This concept of racing is further extended by Domingos and Hulten (2001): By intro- 
ducing an upper bound for the learner's loss as a function of the examples, the procedure 
allows for an early stopping of the learning process, if the loss is nearly as optimal as for 
infinite data. Birattari et al. (2002) apply racing in the domain of evolutionary algorithms 
and extend the framework by using the Friedman test to filter out non-promising configu- 
rations. While Bradley and Schapire (2008) use similar concepts in the context of boosting 
(FilterBoost), Mnih et al. (2008) introduce the empirical Bernstein Bounds to extend both 
the FilterBoost framework and the racing algorithms. In both cases the bounds are used 
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to estimate the error within a specific e-region with a given probability. Pelossof and Jones 
(2009) use the concept of sequential testing to speed up the boosting process by control- 
ling the number of features which are evaluated for each sample. In a similar fashion this 
approach is used in Pelossof and Ying (2010) to increase the speed of the evaluation of 
the perceptron and in Pelossof and Ying (2011) to speed up the Pegasos algorithm. Stan- 
ski (2012) uses a partial leave-one-out evaluation of model performance to get an estimate 
of the overall model performance, which is used to pick the most probable best model. 
These racing concepts are applied in a wide variety of domains like reinforcement learning 
(Heidrich-Meisner and Igel, 2009) and timetabling (Birattari, 2009) showing the relevance 
and practical impact of the topic. 

Recently, Bayesian optimization has been applied to the problem of hyper-parameter 
optimization of machine learning algorithms. Bergstra et al. (2011) use the sequential 
model-based global optimization framework (SMBO) and implement the loss function of 
an algorithm via hierarchical Gaussian processes. Given the previously observed history of 
performances, a candidate configuration is selected which minimizes this historical surro- 
gate loss function. Applied to the problem of training deep belief networks this approach 
shows superior performance over random search strategies. Snoek et al. (2012) extend this 
approach by including timing information for each potential model, i.e. the cost of learning 
a model and optimizing the expected improvement per seconds leads to a global optimiza- 
tion in terms of wall-clock time. Thornton et al. (2012) apply the SMBO framework in the 
context of the WEKA machine learning toolbox: the so-called Auto-WEKA procedure does 
not only find the optimal parameter for a specific learning problem but also searches for 
the most suitable learning algorithm. Like the racing concepts, these Bayesian optimization 
approaches are orthogonal to the CVST approach and could be combined to speed up each 
step of the CVST loop. 

On first sight, the multi-armed bandit problem (Berry and Fristedt, 1985; Cesa-Bianchi 
and Lugosi, 2006) also seems to be related to the problem here in another way: In the multi- 
armed bandit problem, a number of distributions are given and the task is to identify the 
distribution with the largest mean from a chosen sequence of samples from the individual 
distributions. In each round, the agent chooses one distribution to sample from and typically 
has to find some balance between exploring the different distributions, rejecting distributions 
which do not seem promising and focusing on a few candidates to get more accurate samples. 

This looks similar to our setting where we also wish to identify promising candidates 
and reject underperforming configurations early on in the process, but the main difference 
is that the multi-armed bandit setting assumes that the distributions are fixed whereas we 
specifically have to deal with distributions which change as the sample size increases. This 
leads to the introduction of a safety zone, among other things. Therefore, the multi-armed 
bandit setting is not applicable across different sample sizes. On the other hand, the multi- 
armed bandit approach is a possible extension to speed up the computation within a fixed 
training size similar to the Hoeffding races already mentioned above. 

6.2 Checking the Similarity of the Error Landscape 

In the evaluation of the CVST method in Section 5 we see that the Friedman test for the 
regression case shows a much more aggressive behavior than the Cochran's Q test used in 
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the top or flop conversion in the classification case. This feature can be clearly seen in 
Figure 14 where the dropping rates of the classification and regression benchmark data sets 
can be easily compared. Since the Friedman test acts on the squared residuals it uses more 
information compared to the classification task where we just have the information whether 
a specific data point was correctly classified or not. Thus, the Friedman test can exploit 
the higher detail of the information and can decide much faster than the Cochran's Q test 
which of the configurations are significantly different from the top performing ones. 

In this section we show how the modular design of the CVST algorithm can be utilized 
to fit a less aggressive, yet more robust similarity test for regression data into the overall 
framework. It comes at no surprise that this increased tolerance affects the runtime of the 
CVST procedure. In the following we will first develop the alternative similarity test and 
then compare its performance both on the toy and the benchmark data sets to the original 
Friedman variant. 

Recall from Section 3.1 that the top or flop assignment was calculated in a sequential 
manner: First we order all remaining configurations according to their mean performance; 
then we check at which point the addition of another configuration shows a significantly 
different behavior compared to all other, better performing configurations. To employ a 
less strict version of the Friedman test we drop the actual residual information and instead 
use the outlier behavior of a configuration for comparison. To this end we assume that the 
residuals are normally distributed with mean zero and a configuration-dependent variance 
ol which we estimate from the actual residuals. Now we can check for each calculated 
residual whether it exceeds the ^ confidence interval around zero by using the normality 
assumption, thus converting the raw residuals in a binary information whether it is deemed 
as an outlier or not. Similar to the classification case this binary matrix forms the input 
to the Cochran's Q test which then asserts whether a specific configuration belongs to the 
top-performing ones or not. 

The results of this procedure on the noisy sine data set is shown in Figure 16: Compared 
to the outcome of the Friedman test in Figure 10 we can clearly see that the conservative 
nature of the outlier-based test helps in finding the correct parameter configuration. Obvi- 
ously its higher retention rate leads to lower runtime performance: The speed ratio drops 
roughly by a factor of |. A similar behavior can be observed on the benchmark data sets 
in Figure 17: The conservative behavior of the outlier-based measure keeps the difference 
to the full cross-validation lower compared to the residual-based test, also resulting in a 
lower speed ratio. Interestingly, for the benchmark data sets the speed impact on the SVR 
is much lower compared to the speed ratio decrease of the KRR method. We can observe 
this shift also in the number of kept configurations shown in Figure 18 both for the noisy 
sine and the benchmark data sets. 

The conclusion of this discussion is two- fold: First, this section shows how the modular 
construction of the CVST methods allows for the exchange of the individual parts of the 
algorithm without disrupting the workflow of the procedure. If the residual-based test turns 
out to be unsuitable for a given regression problem, it is extremely easy to devise an adapted 
version for instance by looking at the outlier behavior of the configurations. Second, we see 
the inherent flexibility of the CVST algorithm. If there is a need for different error measures 
(for instance multi-class experiments, structured learning etc.), the modularized structure 
of the CVST algorithms allows for maximal flexibility and adaptability to special cases. 
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Figure 16: Difference in mean square error (left plots) and relative speed gain (right plots) 
for the noisy sine data set using the outlier-based similarity test. In comparison 
to the stricter Friedman test used in Figure 10 we can observer a more conser- 
vative behavior resulting in increased robustness at the expense of performance. 




Figure 17: Difference in mean square error (left plot) and relative speed gain (right plot) 
for the benchmark data sets using the outlier-based similarity test. Compared 
to Figure 13 we can see better accuracy behavior but decreased performance. 
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;ure 18: Remaining configurations after each step for the noisy sine and different bench- 
mark data sets using the outlier-based similarity test. Compared to Figure 11 
and Figure 14 we can clearly observe the higher retention rate of this more 
conservative test. 
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6.3 Determining Significant Losers: Open versus Closed Sequential Testing 

As already introduced in Section 3.2 the sequential testing was pioneered by Wald (1947); 
the test monitors a likelihood ratio of a sequence of i.i.d. Bernoulli variables Bi, B2, ■ ■ ■ : 

n n 

* = II ^ 6 *' II f( bi > ^o) given H h : B t ~ ir h , h G {0, 1}. 
i=i i=i 

Hypothesis i?i is accepted if t > A and contrary Hq is accepted if I < B. If neither of 
these conditions apply, the procedure cannot accept either of the two hypotheses and needs 
more data. A and B are chosen such that the error probability of the two decisions does 
not exceed and fli respectively. In Wald and Wolfowitz (1948) it is proven that the open 
sequential probability ratio test of Wald is optimal in the sense that compared to all tests 
with the same power it requires on average fewest observations for a decision. The testing 
scheme of Wald is called open since the procedure could potentially go on forever, as long 
as i does not leave the (A, S)-tunnel. 

The open design of Wald's procedure led to a development of a different kind of sequen- 
tial tests, where the number of observations is fixed beforehand (see Armitage, 1960; Spicer, 
1962; Ailing, 1966; McPherson and Armitage, 1971). For instance in clinical studies it might 
be impossible or ethically prohibitive to use a test which potentially could go on forever. 
Unfortunately, none of these so-called closed tests exhibit an optimality criterion, therefore 
we choose one which at least in simulation studies showed the best behavior in terms of 
average sample number statistics: The method of Spicer (1962) is based on a gambler's 
ruin scenario in which both players have a fixed fortune and decide to play for n games. 
If f(n,ir,F a ,F b ) is the probability that a player with fortune F a and stake 6 will ruin his 
opponent with fortune F b in exactly n games, then the following recurrence holds: 



f(n,ir,F a ,F b ) = < 



'o if F a < V (n = AF b > 0), 
1 if n = A F a > A F b < 0, 
irf{n-l,ir,F a + l,F b -b) 

+(1 — 7r)/(n — 1, 7r, F a — b,F b + b) otherwise. 



In each step, the player can either win a game with probability it and win 1 from his 
opponent or lose the stake b to the other player. Now, given n = x + y games of which 
player A has won y and player B has won x, the game will stop if either of the following 
conditions hold: 

b F a b F b 
y — bx = —F a 44> y = -n — or y — bx = F b <^ y = -n 



1+6 1+6 " 1+6 1+6 

This formulation casts the gambler's ruin problem into a Wald-like scheme, where we just 
observe the cumulative wins of player A and check whether we reached the lower or upper 
line. If we now choose F a and F b such that /(n, 0.5, F a , F b ) < ai, we construct a test 
which allows us to check whether a given configuration performs worse than it = 0.5 (i.e. 
crosses the lower line) and can therefore be flagged as an overall loser with controlled error 
probability of ai (see Ailing (1966)). For more details on the closed design of Spicer please 
consult Spicer (1962). 



33 



Krueger, Panknin and Braun 




Figure 19: Relative speed gain of fast cross-validation compared to full cross-validation. We 
assume that training time is cubic in the number of samples. Shown are runtimes 
for 10- fold cross-validation on different problem classes by different loser/winner 
ratios (easy: 3:1; medium: 1:1, hard: 1:3) over 200 resamples. 



Since simulation studies show that the closed variants of the sequential testing exhibit 
low average sample number statistics, we first have a look at the runtime performance of 
the CVST algorithm equipped with either the open or the closed sequential test. The most 
influential parameter in terms of runtime is the S parameter. In principle, a larger number 
of steps leads to more robust estimates, but also to an increase of computation time. We 
study the effect of different choices of this parameter in a simulation. For the sake of 
simplicity we assume that the binary top or flop scheme consists of independent Bernoulli 
variables with 7r w i nner G [0.9,1.0] and 7ri oser 6 [0.0,0.1]. We test both the open and the 
closed sequential test and compare the relative speed-up of the CVST algorithm compared 
to a full 10-fold cross-validation in case the learner is cubic. 

Figure 19 shows the resulting simulated runtimes for different settings. The overall 
speed-up is much higher for the closed sequential test indicating a more aggressive behavior 
compared to the more conservative open alternative. Both tests show their highest increase 
in the range of 10 to 20 steps with a rapid decline towards the higher step numbers. So in 
terms of speed the closed sequential test definitely beats the more conservative open test. 

To evaluate the false negatives of the closed sequential test we simulate switching config- 
urations by independent Bernoulli variables which change their success probability ir from 
a chosen -^before G {0.1, 0.2, . . . , 0.5} to a constant 1.0 at a given change point. By using 
this setup we mimic the behavior of a switching configuration which starts out as a loser 
and after enough data is available turns into a constant winner. The results can be seen in 
Figure 20 which reveals that the speed gain comes at a price: Apart from having no con- 
trol over the safety zone, the number of falsely dropped configurations is much higher than 
for the open sequential test (see Figure 21 in the Appendix D). While having a definitive 
advantage over the open test in terms of speed, the false negative rate of the closed test 
renders it useless for the CVST algorithm. 
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Figure 20: False negatives generated with the closed sequential test for non-stationary con- 
figurations, i.e., at the given change point the Bernoulli variable changes its 
■^before from the indicated value to 1.0. 



7. Conclusion 

We presented a method to speed up the cross-validation procedure by starting at subsets 
of the full training set size, identifying clearly underper forming parameter configurations 
early on and focusing on the most promising candidates for the larger subset sizes. We have 
discussed that taking subsets of the data set has theoretical advantages when compared 
to other heuristics like local search on the parameter set because the effects on the test 
errors are systematic and can be understood statistically. On the one hand, we showed that 
the optimal configurations converge to the true ones as sample sizes tend to infinity, but 
we also discussed in a concrete setting how the different behaviors of estimation error and 
approximation error lead to much faster convergence practically. These insights led to the 
introduction of a safety zone through sequential testing, which ensures that underperforming 
configurations are not removed prematurely when the minima are not converged yet. In 
experiments we showed that our procedure leads to a speed-up of up to 120 times compared 
to the full cross-validation without a significant increase in prediction error. 

It will be interesting to combine this method with other procedures like the Hoeff- 
ding races or algorithms for multi-armed bandit problems. Furthermore, getting accurate 
convergence bounds even for finite sample size settings is another topic for future research. 
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Appendix A. Examples for learning algorithms for which condition (2) 
holds 

In Section 2, one condition was that for a fixed parameter configuration c, the expected risk 
of the learned predictor converges as the number of samples goes to infinity. In particular, 
we need to ensure that the test error for a fixed configuration converges. In this section, we 
discuss some examples. We refer to the book by Devroye et al. (1996) for the theoretical 
results. Please consult the book for the original publications. 

The condition is closely related to the theory of uniform convergence in the empirical 
risk minimization framework. In this approach, an algorithm is interpreted as choosing the 
solution g n with the best error on the training set R n {g) = \ Y^l=\ ^(ffC^i)) ^) horn some 
hypothesis class Q. If the VC-dimension of Q, which roughly measures the complexity of Q, 
is finite then it holds that 

R n {g) -> R{g) 

uniformly over g 6 Q, and consequently also R(g n ) ^geg R(q)- 

Now in order to make the link to our condition (2), we need that each parameter c 
corresponds to a fixed hypothesis class Q c (and not depend on the sample size in some 
way). For feed- forward neural networks, one can show, for example, that neural networks 
with one hidden layer with k inner nodes and sigmoid activation function have finite VC- 
dimension (Devroye et al., 1996, Theorem 30.6). 

For kernel machines, we consider the reproducing kernel Hilbert space (RKHS, Aron- 
szajn (1950)) view: Let Hk the RKHS induced by a Mercer kernel k with norm || • ||^ fc . 
Evgeniou and Pontil (1999) show that the V^-dimension of the hypothesis class Q{A) = 
{/ £ Hk | ||/||% < ^4} is finite, from which uniform converges of the kind described above 
follows, and thus also that our condition (2) holds. 
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Many kernel methods, including kernel ridge regression and support vector machines 
can be written as regularized optimization problems in the RKHS of the form: 

Now if we assume that £(f(x),y) is bounded by B and continuous in /, it follows that the 
minimum is attained for some / with \\f\\^ k < B/C: For ||/|| Wfc = 0, R n (f) + C||/||^ fc < B, 
and for ||/||^ > B/C, R n {f) + C\\f\\ 2 Hk > B. Because R(f) + C\\f\\ 2 Hk is continuous in /, 
it follows that the minimum is somewhere in-between. 

Now R n (f) converges to R(f) uniformly over / G Q(B/C), such that there exists an 
A < B/C such that 

and we see that a regularization constant C corresponds to a fixed hypothesis class G{A) 
and condition (2) holds again. 

As a direct consequence of this discussion we have to take care of the correct scaling 
of the regularization constants during the CVST run. Thus, for kernel ridge regression we 
have to scale the A parameter linearly with the data set size and for the SVR divide the C 
parameter accordingly. 



Appendix B. Proof of Theorem 1 

Proof Recall (see Equation 1) that e n (c) is the expected error of parameter configuration c. 
We first prove that we have uniform convergence over finite sets of candidate configurations 
if the error for individual configurations converges. Let e > 0, then 

pjmax|e n (c) - e(c)\ > e] = P ( \J \e n (c) - e(c)\ > e) < P{|e n (c) - e(c) > e|} -> 

since for each fixed c, e n (c) — > e(c) in probability. 

For the proof of the second statement (convergence of the minimum), let e = max cg c* |e„(c) — 
e(c)|, then 



e«) - e(c*) = e«) - e n (c* n ) + e„«) - e(c*) 

< e + e n «) - e(c*) 
<e + eri ( C *)-e(c*) 

< e + e = 2e, 



where the first and third inequality hold because of the uniform bound on the error, and 
the second one because c* is the minimizer of e n . 

Convergence in probability follows because P{e(c*) — e(c*) > e} < P{max cg c |e n (c) — 
e(c)| >§}->(). 
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Finally, for the third statement (convergence for subsets), we start by using the same 
argument as for the second statement, 

e n(c* m ) - e„(c*) < 2max|e m (c) - e n (c)| 

< 2 max |e m (c) — e(c)| + 2 max |e(c) — e„(c)| 

cGC cgC 
=: 2m m + 2m n . 

Now fix some S, £ > 0. First of all, note that 

e e 

2m m + 2m n > e implies 2m m > - V 2m n > -. 

Then, 

P{2m m + 2m n > e} < P{m m > e/4} + P{m n > e/4}. 

As already shown, these probabilities converge to zero, such that there exists an I such that 
for all n,m > I, 

5 S 
P{m m > e/4} < -, and P{m n > e/4} < -. 

Therefore, 

P{e n (c* m ) - e n «) > e} < P{m m > e/4} + P{m n > e/4} < 5, 
for all m, n > I, in particular for / < m < n. ■ 



Appendix C. Proof of Safety Zone Bound 

In this section we prove the safety zone bound of Section 4.1 of the paper. We will follow 
the notation and treatment of the sequential analysis as found in the original publication 
of Wald (1947), Sections 5.3 to 5.5. First of all, Wald proves in Equation 5:27 that the 
following approximation holds: 



ASN(7ro,7ri|7r = 1.0) = 



log^ 



The minimal ASN(-7To, 7Ti|7t = 1.0) is therefore attained if log ^ is maximal, which is clearly 
the case for m = 1.0 and ttq = 0.5, which holds by construction. So we get the lower bound 
of S for a given significance level a\ , Pi : 



S > 



log /log 2 

ai 



The lower line Lq of the graphical sequential analysis test as exemplified in Figure 3 of the 
paper is defined as follows (see Equation 5:13 - 5:15): 

log jA- log 

T ° 1 — ai O 1 — 7T1 



log ^ - log i=Jl log ^ - log i=f^ 

° 7TQ ° 1 — 7TQ ° 7T() ° 1 — 7TQ 
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Figure 21: False negatives generated with the open sequential test for non-stationary con- 
figurations , i.e., at the given change point the Bernoulli variable changes its 
■^before from the indicated value to 1.0. 



Setting Lq = 0, we can get the intersection of the lower test line with the x-axis and therefore 
the earliest step s sa f e , in which the procedure will drop a constant loser configuration. This 
yields 

log^- log^o logr^- logr^- logr^- 

•Ssafe — — : ~ ; TZ^f. / : 



log 21 - log i=EL log 21 - log i=2l " log i=2fi l g 1=EL , s /l- ft 

° 7T0 ° 1 — 7TO ° WO ° 1— 7T0 1 — 7T1 ° 1 — 7T0 lOg Z — *7 — — — 

The last equality can be derived by inserting the closed form of 7Ti given ttq = 0.5: 



S = ASN„ , ^ = !.0) = ^ = « ^ = JEE « „ = \ JEE. 

log 21 log2vri y a z 2| qj 

Setting s sa f e hi relation to the maximal number of steps S yields the safety zone bound 
of Section 4.1. 



Appendix D. False Negative Rate for Underestimated Change Point 

In Section 4.1 we have investigated how the CVST algorithm performs if the experimenter 
was able to ensure a stable regime via the safety zone. Now, we go even a step fur- 
ther and look at the performance if the experimenter underestimated the change point 
s cp . To get insight into the dropping rate we simulate those switching configurations by 
independent Bernoulli variables which change their success probability tt from a chosen 
"^before £ {0.1,0.2, . . . ,0.5} to a constant 1.0 at a given change point. This behavior essen- 
tially imitates the behavior of a switching configuration which starts out as a loser (i.e. up 
to the change point the trace will consist more or less of zeros) and after enough data is 
available turns into a constant winner. 
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The relative loss of these configurations for 10 and 20 steps is plotted in Figure 21 for 
different change points. The figure reveals our theoretical findings of Lemma 2 showing 
the corresponding safety zone for the specific parameter settings: For instance for a\ = 
0.01 and Pi = 0.1 and S = 10 steps, the safety zone amounts to 0.27 x 10, meaning that 
if the change point for all switching configurations occurs at step one or two, the CVST 
algorithm would not suffer from false positives. Similarly, for S = 20 the safety zone is 
0.39 x 20 = 7.8. These theoretical results are confirmed in our simulation study, where 
the false negative rate is zero for sufficiently small change points for the open variant 
of the test. After that, there are increasing probabilities that the configuration will be 
removed. Depending on the success probability of the configuration before the change 
point, the resulting false negative rate ranges from mild for ir = 0.5 to relatively severe for 
7r = 0.1. The later the change point occurs, the higher the resulting false negative rate 
will be. Interestingly, if we increase the total number of steps from 10 to 20, the absolute 
values of the false negative rates are significantly lower. So even when the experimenter 
underestimates the actual change point, the CVST algorithm has some extra room which 
can even be extended by increasing the total number of steps. 



Appendix E. Proof of Computational Budget 

For the size N of the whole data set and a cubic learner, resulting in a learning time of 
t = N 3 , one observes that learning on a proportion of size j^N takes about j^t time. Via 
construction one has to learn on all k parameter configurations in each step before hitting 
s r x S and on K x (1 — r) parameter configurations with drop ratio r afterwards. Thus the 
entirely needed computation time is given by 



i=i 



i=i 



which should be smaller than the given time budget T. 



J 

Making use of the equality * 



one can reformulate the inequality: 



T > t x K(l-r) s r x S 2 {s r xS + 1) 2 + txK xrS 2 (S+l) 2 



S 3 

t x K 



^ s 2 (s r xS + l) 2 
(l_ r ) +r 



S 3 
(S+l) 2 



It is obvious that this inequality is quadratic in the variable S which can be solved by 
bringing the above inequality in standard form: 



> 



(1-r 



s 2 (s r x S + l) 



_l_ r 



(S + l) 



> 



(1 — r)sf. + r 



S + 1st + 



(1 - r)s 3 . + r 



t x k 



TxS 
t x k 



o > Ss; + 2 



t x k(l - r)s 3 +t x k x r 



((1 - r)sf + r)t x k 



2T S+ (l-r)s 2 +r 



(l-r)sf + r' 
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Substituting a = ^fc^O^r+^fcxr 2T and & = (L_rlj|±r a b OV e is equivalent to: 

S = -a + y, y G j-\At 2 - b, +V a 2 - . 

For the sake of a meaningful step amount, i.e. S > and furthermore 5 as large as possible 
we choose it as 

S = —a + \/a 2 — 

Note that S" is a function of the parameter s. Since obviously b > holds, a must be 
negative in order to gain a positive step amount. Furthermore the root has to be solvable. 
So the following constraints on s r have to be made: 

(1) 2T>txk(l-r)s 3 r + txkxr 

(2) a 2 > b. 

Appendix F. Example Run of CVST Algorithm 

In this section we give an example of the whole CVST algorithm on one noisy sine data set 
of n = 1,000 data points with intrinsic dimensionality of d = 2. The CVST algorithm is 
executed with S = 10 and i^stop = 4. We use a z^-SVM (Scholkopf et al., 2000) and test a 
parameter grid of log 10 (cr) 6 {—3, —2.9, . . . , 3} and v G {0.05, 0.1, . . . , 0.5}. The procedure 
runs for 4 steps after which the early stopping rule takes effect. This yields the following 
traces matrix (only remaining configurations are shown): 









n = 90 


n = 180 


n = 270 


n = 360 


!ogio(^) 


= -2.3,z/ 


= 0.35 








1 





1 °Sio( <7 ) 


= -2.3, v 


= 0.40 





1 


1 





1 °gio(°') 


= -2.3, v 


= 0.45 





1 





1 




= -2.2, v 


= 0.30 





1 








1 °gio(c r ) 


= -2.2, v 


= 0.35 





1 


1 





logio(o-) 


= -2.2, v 


= 0.40 





1 


1 




1 °gio(°') 


= -2.2, v 


= 0.45 





1 


1 




1 °gio(°') 


= -2.2, v 


= 0.50 








1 




!ogio(o") 


= -2.1, v 


= 0.35 





1 


1 




1 °gio(°') 


= -2.1, v 


= 0.40 





1 


1 




!ogio(o") 


= -2.1, v 


= 0.45 





1 


1 




logio(o-) 


= -2.1, v 


= 0.50 


1 





1 




1 °gio(°') 


= -2.0, v 


= 0.50 








1 





The corresponding mean square errors of the remaining configurations after each step 
are shown in the next matrix. Based on these values, the winning configuration, namely 
log 10 (<r) = —2.1, v = 0.40 is chosen: 



44 



Fast Cross- Validation via Sequential Testing 











n = 90 


n = 180 


n = 270 


n = 360 




5io( ff ) 


= -2.3,i/ 


= 0.35 


0.0370 


0.0199 


0.0145 


0.0150 


lOf 


Sio( CT ) 


= -2.3, z/ 


= 0.40 


0.0362 


0.0197 


0.0146 


0.0146 






= —2.3, z/ 


= 0.45 


0.0356 


0.0197 


0.0146 


0.0144 


1 




= —2.2, v 


= 0.30 


0.0365 


0.0195 


0.0146 


0.0148 


lo£ 




= —2.2, v 


= 0.35 


0.0351 


0.0193 


0.0142 


0.0145 


lOf 


?io( CT ) 


= -2.2, 1/ 


= 0.40 


0.0345 


0.0194 


0.0143 


0.0141 


lOE 




= —2.2, v 


= 0.45 


0.0340 


0.0193 


0.0143 


0.0140 


lof 


;io( CT ) 


= -2.2, z/ 


= 0.50 


0.0332 


0.0200 


0.0145 


0.0138 


lot 


5io(°") 


= -2.1, z/ 


= 0.35 


0.0353 


0.0194 


0.0144 


0.0142 


lot 


5io( ff ) 


= -2.1, z/ 


= 0.40 


0.0343 


0.0195 


0.0142 


0.0138 


lOf 


ho(^) 


= -2.1, v 


= 0.45 


0.0340 


0.0197 


0.0140 


0.0138 


lof 


Sio(^) 


= -2.1, z/ 


= 0.50 


0.0329 


0.0199 


0.0142 


0.0137 


1°£ 


5io( ff ) 


= -2.0, z/ 


= 0.50 


0.0351 


0.0204 


0.0145 


0.0137 



Appendix G. Non-Parametric Tests 

The tests used in the CVST algorithm are common tools in the field of statistical data 
analysis. Here we give a short summary based on Heckert and Filliben (2003) and cast 
the notation into the CVST framework context. Both methods deal with the performance 
matrix of K configurations with performance values on r data points: 



Data Points 



Configuration 


1 


2 


r 


1 




X\2 ■ 


• ■ X\ r 


2 




X22 ■ 


■ ■ X 2r 


3 


£31 


X32 ■ 


■ ■ x 3r 


K 


XKl 


XK2 ■ 


. . X Kr 



Both tests treat similar questions ("Do the K configurations have identical effects?") 
but are designed for different kinds of data: Cochran's Q test is tuned for binary Xij while 
the Friedman test acts on continuous values. In the context of the CVST algorithm the 
tests are used for two different tasks: 

1. Determine whether a set of configurations are the top performing ones (step O in the 
overview Figure 3 and the function topConfigurations in Algorithm 1). 

2. Check whether the remaining configurations behaved similar in the past (step © in 
the overview Figure 3 and the function similarPerformance in Algorithm 1). 

In both cases, the configurations are compared either by the performance on the samples 
(Point 1 above) or on the last w s t p traces (Point 2 above) of the remaining configurations. 
Depending on the learning problem either the Friedman Test for regression task or the 
Cochran's Q test for classification tasks is used in Point 1. 
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In both cases the hypotheses for the tests are as follows: 

• Ho: All configurations are equally effective (no effect) 

• H\: There is a difference in the effectiveness among the configurations, i.e., there is 
at least one configuration showing a significantly different effect on the data points. 



G.l Cochran's Q Test 

The test statistic T is calculated as follows: 

EK ( r> _ M\2 

with Ri denoting the row total for the i th configuration, d the column total for the i th data 
point, and M the grand total. We reject H , if T > x 2 (l — a,K — l) with x 2 (l — a,K—l) 
denoting the (1 — a)-quantile of the x 2 distribution with K — 1 degrees of freedom and a 
is the significance level. As Cochran (1950) points out, the \ 2 approximation breaks down 
for small tables. Tate and Brown (1970) state that as long as the table contains at least 24 
entries, the x 2 approximation will suffice, otherwise the exact distribution should be used 
which can either be calculated explicitly (see Patil, 1975) or determined via permutation. 



G.2 Friedman Test 

Let R(xij) be the rank assigned to x-ij within data point i (i.e., rank of a configuration on 
data point i). Average ranks are used in the case of ties. The ranks for a configuration at 
position k are summed up over the data points to obtain 

r 

R k = ~^2 R(xki)- 
i=i 

The test statistic T is then calculated as follows: 

i=l 

If there are ties, then 

T _ (if-l)E£i(^-r(if+l)/2) 2 

E£i EU R{*ij) 2 \ ~ [rK{K + l)2]/4" 

We reject Hq if T > ^(a^K — 1) with ^{a^K — 1) denoting the a-quantile of the x 2 
distribution with K — 1 degrees of freedom and a being the significance level. 



4G 



